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Abstract 

In supervised binary hashing, one wants to learn a function that maps a high-dimensional feature 
vector to a vector of binary codes, for application to fast image retrieval. This typically results in a 
difficult optimization problem, nonconvex and nonsmooth, because of the discrete variables involved. 
Much work has simply relaxed the problem during training, solving a continuous optimization, and 
truncating the codes a posteriori. This gives reasonable results but is quite suboptimal. Recent work has 
tried to optimize the objective directly over the binary codes and achieved better results, but the hash 
function was still learned a posteriori, which remains suboptimal. We propose a general framework for 
learning hash functions using affinity-based loss functions that uses auxiliary coordinates. This closes the 
loop and optimizes jointly over the hash functions and the binary codes so that they gradually match each 
other. The resulting algorithm can be seen as a corrected, iterated version of the procedure of optimizing 
first over the codes and then learning the hash function. Compared to this, our optimization is guaranteed 
to obtain better hash functions while being not much slower, as demonstrated experimentally in various 
supervised datasets. In addition, our framework facilitates the design of optimization algorithms for 
arbitrary types of loss and hash functions. 


1 Introduction 


Information retrieval arises in several applications, most obviously web search. For example, in image 
retrieval, a user is interested in finding similar images to a query image. Computationally, this essentially 
involves defining a high-dimensional feature space where each relevant image is represented by a vector, 
and then finding th e closest points (nea r est n eighbors) to the vector for the query image, accord ing to 


a sui table distanceJlShakhnarovichet^IJ, 200^ . For example, one can use features such as SIFT (|Lowd . 
I 2 OO 4 II or GIST ( Oliva and Torralbal I 2 OO 1 I 1 and the Euclidean distance for this purpose. Finding nearest 
neighbors in a dataset of N images (where N can be millions), each a vector of dimension D (typically in 
the hundreds) is slow, since exact algorithms run essentially in time 0{ND) and space 0{ND) (to store 
the image dataset). In prac tice, this is approximated, and a successful way to do this is binary hashing 
I Grauman and Fergusl . [2oT5l . Here, given a high-dimensional vector x G R^, the hash function h maps it 
to a 6-bit vector z = h(x) € { —I,-|-I}^, and the nearest neighbor search is then done in the binary space. 
This now costs 0{Nh) time and space, which is orders of magnitude faster because typically b < D and, 
crucially, (1) operations with binary vectors (such as computing Hamming distances) are very fast because 
of hardware support, and (2) the entire dataset can fit in (fast) memory rather than slow memory or disk. 

The disadvantage is that the results are inexact, since the neighbors in the binary space will not be 
identical to the neighbors in the original space. However, the approximation error can be controlled by using 
sufficiently many bits and by learning a good hash function. This has been the topic of much work in recent 
years. The general approach consists of defining a supervised objective that has a small value for good hash 
functions and minimizing it. Ideally, such an objective function should be minimal when the neighbors of 
any given image are the same in both original and binary spaces. Practically in information retrieval, this 
is often evaluated using precision and recall. However, this ideal objective cannot be easily optimized over 
hash functions, and one uses approximate objectives instead. Many such objectives have been proposed in 
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the literature. We focus here on ajfinity-based loss functions, which directly try to preserve the original 
similarities in the binary space. Specifically, we consider objective functions of the form 

N 

mm£(h) = ^ L(h(x„),h( 

^ 771 ); ynm ) ( 1 ) 

n,m—l 


where X = (xi,...,x^r) is the high-dimensional dataset of feature vectors, miuh means minimizing over 
the parameters of the hash function h (e.g. over the weights of a linear SVM), and L(-) is a loss function 
that compares the codes for two images (often through their Hamming distance ||h(x„) — h(xm)||) with 
the ground-truth value ynm that measures the affinity in the ori ginal space between the two images x„ 
and Xm (distance, similarity or other measure of neighborhood; ICrauman and Fergud . l2013ll . The sum 
is often restricted to a subset of image pairs (n, m) (for example, within the k nearest neighbors of each 
other in the original space), to keep the runtime low. Examples of these objective functions (described 
below) include models de veloped for dimension reduction, be they spectral s uch a s Laplacian Eigenmaps 
( Belkin and Nivo^ l2003ll and Locally Linear E mbedding ( Roweis and Saul 2000ll. or nonlin ear such as 
the Elastic Embedding ( Carreira-Perpindil I 2 OIOII or t-SNE ( van der Maaten and Hintonl . 2008 1: as well as 
objective functio ns designed specifically for binary hashing, suc h as Supervised Hashing with Kernels (KSH) 


(Liu et al. 


20091) or Semi-supervised 


_ I 2 OI 2 II . Binary Reconstructive Embed dings (BRE) ( Kulis and Darreli 

sequential Projection Learning Hashing (SPLH) (IWang et al.L 2012ll . 

If the hash function h was a continuous function of its input x and its parameters, one could simply 
apply the chain rule to compute derivatives over the parameters of h of the objective function o and then 
apply a nonlinear optimization method such as gradient descent. This would be guaranteed to converge to 
an optimum under mild conditions (for example , Wolfe conditions on the lin e search), which would be global 
if the objective is convex and local otherwise (INocedal and Wrightl . [200^. Hence, optimally learning the 
function h would be in principle doable (up to local optima), although it would still be slow because the 
objective can be quite nonlinear and involve many terms. 

In binary hashing, the optimization is much more difficult, because in addition to the previous issues, 
the hash function must output binary values, hence the problem is not just generally nonconvex, but also 
nonsmooth. In view of this, much work has sidestepped the issue and settled on a simple but suboptimal 
solution. Eirst, one defines the objective function (ED directly on the 6-dimensional codes of each image 
(rather than on the hash function parameters) and optimizes it assuming continuous codes (in R^). Then, 
one binarizes the codes for each image. Finally, one learns a hash function given the codes. Optimizing the 
affinity-based loss function ED can be done using spectral methods or nonlinear optimization as described 
above. Binarizing the codes has been done in different ways, from simply roundi ng them to { — I,-1-1} 


: ways, trom simply roundi ng tnem to | — 
d ( Weiss et al.L 2009t Zhang et al.l . 20Iot Liu et ah . 20 III I2OI2I) . to optimally finding 

1 . 2011: Strecha et ah. 20I2r). to rota ting the continuous codes so that thresholding 
Yu and Shil . 2003r Gong et ah . 2013 1. Finally, learning the hash function for each 


using zero as threshold ( Weiss et al.L 20091 Zhang et al.l 
a threshold ( Liu et al]^ 2011: Strecha et ah. 2012r). 
introduces less error 

of the b output bits can be considered as a binary classification problem, where the resulting classifiers 
collectively give th e desired hash fun c tion, and can be solv ed using various machine learning techniques. 


Several works (e.g. Zhang et al. . 201fll Lin et ah . 20131 2014D have used this approach, which does produce 


reasonable hash functions (in terms of retrieval measures such as precision and recall). 

In order to do better, one needs to take into account during the optimization (rather than after the 
optimization) the fact that the codes are constrained to be binary. This implies attempting directly the 
discrete optimization of the affinity-based loss function over binary codes. This is a daunting task, since this 
is usually an NP-complete problem with Nb binary variables altogether, and practical applications could 
make this number as large as millions or beyond. Recent works have applied alternating optimization (with 
various refinements) to this, where one optimizes ov er a usually small subset of binary variables given fixed 


values for the remaining ones ( Lin et al. . 20I3l 2014 1. and this did result in very competitive precision/recall 


compared with the state-of-the-art. This is still slow and future work will likely improve it, but as of now it 
provides an option to learn better binary codes. 

Of the three-step suboptimal approach mentioned (learn continuous codes, binarize them, learn hash 
function), these works manage to join the first two steps and hence learn binary codes. Then, one learns the 
hash function given these binary codes. Can we do better? Indeed, in this paper we show that all elements 
of the problem (binary codes and hash function) can be incorporated in a single algorithm that optimizes 
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jointly over them. Hence, by initializing it from binary codes from the previous approach, this algorithm is 
guaranteed to achieve a lower error and learn better hash functions. In fact, our framework can be seen as 
an iterated, corrected version of the two-step approach: learn binary codes given the current hash function, 
learn hash functions given codes, iterate (note the emphasis). The key to achieve this in a principled way 
is to use a recently proposed method of auxiliary coordinates (MAC) for optimizing “nested” systems, i.e., 
consisting of the composition of two or more functions or processing stages. MAC introduces new variables 
and constraints that cause decoupling between the stages, resulting in the mentioned alternation between 
learning the hash function and learning the binary codes. Section [5] reviews affinity-based loss functions, 
section [3] describes our MAC-based proposed framework, section [3] evaluates it in several supervised datasets, 
using linear and nonlinear hash functions, and section [S] discusses implications of this work. 


Related work Although one can construct hash functions without training data ( Andoni and Indy'S . l2008t 
Kulis and Grauman , 2012fl . we focus on methods that learn the hash function given a training set, since they 


perform better, and our emphasis is in optimization. The learning can be unsupervised, which attempts to 
preserve distances in the original space, or supervised, which in addition attempts to preserve label similarity. 
Many objective functions have been proposed to achieve this and we focus on affinity-based ones. These 
create an affinity matrix for a subset of traini ng points based on their distances (unsupervised) or labels 


(supervised) and combine it with a loss function ( Liu et al. , 2012 ; Kulis and Darrell 120091 : iNorouzi and Fleea . 


2011 : Lin et all 20131 [20l3) . Some methods optimize this directly over the hash function. For example. 


Binary Reconstructive Embeddings (IKulis and Darrell [2009 ) use alterna ting optimization over the weights 
of the hash functions. Supervised Hashing with Kernels ( Liu et ahl . l2012ll learns hash functions sequentially 
by considering the difference between the inner product of the codes and the corresponding element of the 
affinity matrix. Although many approaches exist, a common theme is to apply a greedy approach where one 
first finds codes using an affinity-based loss function, and then fits the hash functions to the m (usually by 


train i ng a classifier). The codes can be fou nd by relaxing the problem and binarizing its solution (IWeiss et al 


20091:IZhang et al.l.l2mnHLiu et al.l. 1201 ih . or by approximately solving for the binary codes using some form 


of alternating optimi zation (possibly combined with Graph Cut), as in two - step hashing dLin et al.1. 2013 , 
2014 ; Ge et al. , 2014 ), or by using relaxation in other ways ( Liu et all . l2012t iNorouzi and Fle^ 2011 ). 


2 Nonlinear embedding and affinity-based loss functions for bi¬ 
nary hashing 

The dimensionality reduction literature has developed a number of objective functions of the form ([I} (often 
called “embeddings”) where the low-dimensional projection z„ G of each high-dimensional data point 
x„ G is a free, real-valued parameter. The neighborhood information is encoded in the ynm values (using 
labels in supervised problems, o r distance-based affinities in unsupervised problems). A representative 
example is the elastic embedding ( Garreira-Perpinhill20irt . where L(z„,Zm; ynm) has the form: 


Vn 


+ ^'Vnm exp (- 


), A>0 


( 2 ) 


where the first term tries to project true neighbors (having yif^ > 0) close together, whi le the second repels 
all no n-neighbors’ projections (having y~^ > 0) from each oth er. Laplacian Eigenmaps (IBelkin and Nivogj 
2003 1 and Locally Linear Embedding ( Roweis and Saull . l2000[l result from replacing the second term above 

with a constraint that fixes the scale of Z, which results in an eigenproblem rather than a nonlinear optimiza- _ 

tion, but also produces more distorted embeddings. Other objectives exist, such as t-SNE (|van der Maaten and Hintonl . 
2008l l. that do not separate into functions of pairs of points. Optimizing nonlinear embeddings is quite chal¬ 
lenging, but much progress has been done recently dGarreira-Perpifian 20H ^ Vladvmyrov and Garreira-Peruinan , 


og: 

2012t Ivan der Maatenl . 120131 : lYang et al.1 . 120131 : IVladvmvrov and Garreira-Perpinanl l2014ll . Although these 
models were developed to produce continuous projections, they have bee n successfully used for binary hash- 


ecentlv IIG^arreira -rerDinanllZUlUtl Vladv myrov am 
2013t Vladvmyrov and Garreira-Perpindiu 2014f . 


r roiect 
20091 : 

(iLin et al.1. [ 2011120 14ll . 


Zhang et al. . 2010l l or using the two-step approach of 


Other loss functions have been developed specifically for hashi ng, where now z „ is a 6-bit vector (where 
binary values are in {—1,-|-1}). For example (see a longer list in lLin et al.1 . l2013ll . for Supervised Hashing 
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( 3 ) 


with Kernels (KSH) L(z„,Zm; Unm) has the form 


where ynrn. is 1 if x. 


vvxic.x^ yirim - ^n ^ -^ m ^ 

( Kulis and Darrell 2009ll uses 


are similar and —1 if they are dissimilar. Binary Reconstructive Embeddings 

^. The exponential variant 


^mlj _ Vnm.^ wher e Unm — 2 11^^ ^^"ii i 

of SPLH ( Wang et al.l . 20l3l proposed bv iLin et al.l ( 2013ll (which we call eSPLH) uses exp(—-iynmZraZn)- 
Our approach can be applied to any of these loss functions, though we will mostly focus on the KSH loss for 
simplicity. When the variables Z are binary, we will call these optimization problems binary embeddings^ in 
analogy to the more traditional continuous embeddings for dimension reduction. 


3 Learning codes and hash functions using auxiliary coordinates 

The optimization of the loss /1(h) in eq. ([1]) is difficult because of the thresholded hash function, which 
appears as the argument o f the loss fun c tion L. We use the recently proposed method of auxiliary coordinates 
(MAC) I Carreira-Pernihan and Wangl 20121 12014 1. which is a meta-algorithm to construct optimization 
algorithms for nested functions. This proceeds in 3 stages. First, we introduce new variables (the “auxiliary 
coordinates”) as equality constraints into the problem, with the goal of unnesting the function. We can 
achieve this by introducing one binary vector z„ € { —1,-|-1} for each point. This transforms the original, 
unconstrained problem into the following, constrained problem: 


N 

min 

Zn 7 7 Unm 

n —1 


) s.t. 


' Zi = h(xi) 
Zn = h(xAr) 


(4) 


which is seen to be equivalent to (P) by eliminating Z. We recognize as the objective function the “embedding” 
form of the loss function, except that the “free” parameters z„ are in fact constrained to be the deterministic 
outputs of the hash function h. 

Second, we solve th e constrained problem usin g a penalty method, such as the quadratic-penalty or 
augmented Lagrangian ( Nocedal and Wrightl . l2006ll . We discuss here the former for simplicity. We solve the 
following minimization problem (unconstrained again, but dependent on p) while progressively increasing p, 
so the constraints are eventually satisfied: 


N 


N 


^ /Ip (^5 M) — ^ ^ C{z^ , Zjyi , ynm) ffi M ^ ^ h(Xyj 


S.t. Zi, . . . ,ZAr e {-1, l}^ (5) 


The quadratic penalty ||z„ — h(x„)|| is proportional to the Hamming distance between the binary vectors 
z„ and h(x„). 

Third, we apply alternating optimization over the binary codes Z and the hash function parameters h. 
This results in iterating the following two steps (described in detail later): 

• Optimize the binary codes Zi,..., zjv given h (hence, given the output binary codes h(xi),..., h(xAr) 
for each of the N images). This can be seen as a regularized binary embedding, because the projections 
Z are encouraged to be close to the hash function outputs h(X). Here, we try two different approaches 
( Lin et ahl . 20131 [201^ with some modifications. 


• Optimize the hash function h given binary codes Z. This reduces to training b binary classifiers using 
X as inputs and Z as targets. 


This is very similar to the two-step (TSH) approach of iLin et al.1 (|2013fl . except that the latter learns the 
codes Z in isolation, rather than given the current hash function, so iterating the two-step approach would 
change nothing, and it does not optimize the loss C. More precisely, TSH corresponds to optimizing Cp for 
/r —>■ O"*". In practice, we start from a very small value of p (hence, initialize MAC from the result of TSH), 
and increase p slowly while optimizing Cp, until the equality constraints are satisfied, i.e., z„ = h(x„) for 
n = 1 ,..., A^. 

Fig. [1] gives the overall MAC algorithm to learn a hash function by optimizing an affinity-based loss 
function. We now describe the steps over h and Z, and the path followed by the iterates as a function of p. 


4 


































input XdxN = (xi,. . . ,XAr), YnxN = (Unm), b gN 
Initialize = (zi, ■ ■ ■, zat) G {0,1}^^ 
for = 0 < fii < ■ ■ ■ < /ioo 

for z = 1,..., & h step 

hi <r- fit hash function to (X, Z.i) 

repeat Z step 

for i = 1,... ,b 

Z.i <— approximate minimizer of Cpih., Z; /t) over Z.i 

until no change in Z or maxit cycles ran 

if Z = h(X) then stop 
return h, Z = h(X) 


Figure 1: MAC algorithm to optimize an affinity-based loss function for binary hashing. 


3.1 Stopping criterion, schedule over /i and path of optimal values 

It is possible to prove that once Z = h(X) after a Z step (regardless of the value of /r), the MAC algorithm 
will make no further changes to Z or h, since then the constraints are satisfied. This gives us a reliable 
stopping criterion that is easy to check, and the MAC algorithm will stop after a finite number of iterations 
(see below). 

It is also possible to prove that the path of minimizers oi Cp over the continuous penalty parameter 
fj. S [0,oo) is in fact discrete, with changes to (Z,h) happening only at a finite number of values 0 < /ii < 

■ ■ • < Moo < oo. Based on this and on our practical experience, we have found that the following approach 
leads to good schedules for /r with little effort. We use exponential schedules, of the form fii = 
for i = 1,2,..., so the user has to set only two parameters: the initial mi and the multiplier a > 1. We 
choose exponential schedules because typically the algorithm makes most progress at the beginning, and it is 
important to track a good minimum there. The upper value Moo past which no changes occur will be reached 
by our exponential schedule in a finite number of iterations, and our stopping criterion will detect that. We 
set the multiplier to a value 1 < a < 2 that is as small as computationally convenient. If a is too small, the 
algorithm will take many iterations, some of which may not even change Z or h (because the path of minima 
is discrete). If a is too big, the algorithm will reach too quickly a stopping point, without having had time 
to find a better minimum. As for the initial ^i, we estimate it by trying values (exponentially spaced) until 
we find a m for which changes to Z from its initial value (for fJ. = 0) start to occur. (It is also possible to find 
lower and upper bounds for and ^oo, respectively, for a particular loss function, such as KSH, eSPH or 
EE.) Overall, the computational time required to estimate /ii and a is comparable to running a few extra 
iterations of the MAC algorithm. 

Finally, in practice we use a form of early stopping in order to improve generalization. We use a small 
validation set to evaluate the precision achieved by the hash function h along the MAC optimization. If 
the precision decreases over that of the previous step, we ignore the step and skip to the next value of m- 
Besides helping to avoid overfitting, this saves computation, by avoid such extra optimization steps. Since 
the validation set is small, it provides a noisy estimate of the generalization ability at the current iterate, and 
this occasionally leads to skipping a valid m value. This is not a problem because the next /r value, which is 
close to the one we skipped, will likely work. At some point during the MAC optimization, we do reach an 
overfitting region and the precision stops increasing, so the algorithm will skip all remaining /r values until 
it stops. In summary, using this validation procedure guarantees that the precision (in the validation set) is 
greater or equal than that of the initial Z, thus resulting in a better hash function. 
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3.2 h step 

Given the binary codes zi,, zpf, since h does not appear in the first term oi Cp, this simply involves 
finding a hash function h that minimizes 

N b N 

minY^ ||z„ - h(x„)||^ = min {zm - hi{yin)Y 

h hi 

n—1 i—1 n—1 

where Zni G {~1, +1} is the ith bit of the binary vector z„. Hence, we can find b one-bit hash functions in 
parallel and concatenate them into the 6-bit hash function. Each of these is a binary classification problem 
using the number of misclassified patterns as loss. This allows us to use a regular classifier for h, and even 
to use a simpler surrogate loss (such as the hinge loss), since this will also enforce the constraints eventually 
(as /i increases). For example, we can fit an SVM by optimizing the margin plus the slack and using a high 
penalty for misclassified patterns. We discuss other classifiers in the experiments. 


3.3 Z step 


Although the MAC technique has significantly simplified the original problem, the step over Z is still complex. 
This involves finding the binary codes given the hash function h, and it is an NP-complete problem in Nb 
binary variables. Fortunately, some recent works have propose d practical appro aches for this problem based 


on alternating o ptimization: a quadratic surrogate method (ILin et al.l . 120131) . and a GraphCu t method 


( Lin et ah . 20141) . In both cases, this would correspond to the first step in the two-step hashing of Lin et al 

(I2OI3D 


In both the quadratic surrogate and the GraphCut method, the starting point is to apply alternating 
optimization over the ith bit of all points given the remaining bits are fixed for all points (for i = 1 ,..., 6), 
and to solve the optimization over the zth bit approximately. We describe this next for each method. We 
start by describing each method in their original form (which applies to the loss function over binary codes, 
i.e., the first term in Cp), and then we give our modification to make it work with our Z step objective (the 
regularized loss function over binary codes, i.e., the complete Cp). 


Solution using a quadratic surrogate method (|Lin et al.l . 120131 ) This is based on the fact that any 
loss function that depends on the Hamming distanc e of two b i nary v ariables can be equivalently written as 
a quadratic function of those two binary variables ( Lin et ah . l2013l) . Since this is the case for every term 
L(z„, Zml Unm) (because only the ith bit in each of z„ and z^ is free), we can write the first term in Cp as 
a binar y quadrat i c prob lem. We now consider the second term (on /i) as well. (We use a similar notation as 
that of Lin et all . l2013l ) The optimization for the ith bit can be written as: 


N 

min Y^ li{z„ 

n,m—l 


N 

n=l 


{Zni 6j(x^)) 


( 6 ) 


where k = L{zni,z, 


rrn, ^m^m',ynm) is the loss function defined on the Ah bit, Zni is the Ah bit of the nth 
point, z„ is a vector containing the binary codes of the nth point except the Ah bi t , and 6-i(x„,) is the Ah bit 
of the binary code of the nth point generated by the hash function h. iLin et al.l (|2013l) show that Z(zi, Z 2 ) 
can be replaced by a binary quadratic function 


l{zi,Z 2 ) = \ziZ 2 {l^^^'' -I-constant (7) 

as long as ^(1,1) = Z(—1, —1) = and ^(1, —1) = /(—1,1) = where zi,Z 2 & {—1, !}■ Equation ([7]) 

helps us to rewrite the optimization m as the following: 


^ 1 

min > —z^ 


N 

i (/(“) - -b M E “ ^i(Xn)) 


Z(i) 


n,m—l 


n—1 


( 8 ) 
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By defining Unm = i^-linln) ~ (n,m) element of a matrix A S and ignoring the coeffi¬ 

cients, we have the following optimization problem: 


minz^) 

Z(i) ^ ’ 


Az(i) -I- /r ||z(i) 


h,(X) 


s.t. 


^(i) £ {“!) +1}'^ 


where hj(X) = (/li(xi),..., hi(xAr))^ is a vector of length N (one bit per data point). Both terms in the 
above minimization are quadratic on binary variables. This is still an NP-complete problem (except in special 
cases), and we approximate it by relaxing it to a continuous quadratic program (QP) over Zi'j) £ [—1,1]^ and 
binarizing its solution. In general, the matrix A is not positive definite and the relaxed QP is not convex, so 
we need an initialization. (However, the term on /r adds /xl to A, so even if A is not positive definite, A-f /il 
will be positive definite for large enough /r, and the QP will be convex.) We construct an initialization by 
converting the binary QP into a binary eigenproblem: 


min s.t. 

Ot 


ao — 1 , 


Hi) 


€{-1,1} 


N 


a. = 


^(0 ) 

ao 11 


B = 


A 


-th.(X) 

0 


)■ (9) 


To solve this problem we use spectral relaxation, where the constraints Z(j) £ {—1, +1}'^ and Zi+i = 1 are 
relaxed to ||q:|| = A -I- 1. The solution to this problem is the eigenvector corresponding to the smallest 
eigenvalue of B. We use the truncated eigenvector as the initialization for minimizing the relaxed, bound- 
constrained QP: 

minzfi)Az(i)-K/i ||z(i) - hi(X)|| s.t. Z(i) £ [-1,1]^. 


which we solve using L-BFGS-B ( Zhu et al. . 19971) . 

As noted above, the Z step is an NP-complete problem in general, so we cannot expect to find the global 
optimum. It is even possible that the approximate solution could increase the objective over the previous 
iteration’s Z (this is likely to happen as the overall MAC algorithm converges). If that occurs, we simply 
skip the update, in order to guarantee that we decrease monotonically on /Ip, and avoid oscillating around 
a minimum. 


Solution using a GraphCut algorithm ( Lin et al.l . I20l3 ) To optimize over the ith bit (given all the 
other bits are fixed), we have to minimize eq. ([5]). In general, this is an NP-complete problem over N bits 
(the zth bit for each i mage), with the form of a quadratic function on binary variables. W e can apply the 
GraphCut algorithm ( Bovkov and Kolmogorov , 20031 2004 Kolmogorov and Zabihl . 2003h . as proposed by 


the FastHash algorithm of Lin et al. (|2014l ). This proceeds as follows. First, we assign all the data points 


to different, possibly overlapping groups (blocks). Then, we minimize the objective function over the binary 
codes of the same block, while all the other binary codes are fixed, then proceed with the next block, etc. 
(that is, we do alternating optimization of the bits over the blocks). Specifically, to optimize over the bits 


in block B, we dehne a„^ = “ ^(*Am)) ^nd, ignoring the constants, we can rewrite equation 


as: 


min EE 2EE E ^nihi {pCn ) ■ 

neBmeB 


n^B m^B 


n^B 


We then rewrite this equation in the standard form for the GraphCut algorithm: 


min 




n£B 


where Vnm = Onm, Unm = 2 0 -nmZmi—■ To minimize the objective function using the GraphCut 

algorithm, the blocks have to define a submodular function. For the objective functions th at we explained 


2014 


in the paper, this can be easily achieved by putting points with the same label in one block (ILin et al 
give a simple proof of this). 

Unlike in the quadratic surrogate method, using the GraphCut algorithm with alternating optimization 
on blocks defining submodular functions is guaranteed to hnd a Z that has a lower or equal objective value 
that the initial one, and therefore to decrease monotonically Cp. 
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4 Experiments 


We have tested our framework with several combinations of loss function, hash function, number of bits, 
datasets, and comparing with several state-of-the-art hashing methods (appendix contains additional ex- 
periments) . We report a repre sentative subset to show the flexibility of the approach. We use the KSH ([3]) 


(Liu et al. 


2 OI 2 II loss functions. We test quadratic surrogate and Graph- 


201211 and eSPLH (|Wang et al. 

Cut methods for the Z step i n MAC. As hash functions (for each bit), we use linear SVMs (trained with 
LIBLINEAR; iFan et al.l. l2008[l and kernel SVM^ 

We use the fol lowing labeled datasets (all using the Euclidean distance in feature space): (1) CIEAR 
( Krizhevskv . 2009ll contains 60 000 images in 10 classes. We use D = 320 GIST features I Oliva and Torralba . 
2001) from each image. We use 58 000 images for training and 2 000 for test. (2) Infinite MNIST (|Loosli et ah . 
20071) . We generated, using elastic deformations of the original MNIST handwritten digit dataset, 1000 000 
images for training and 2 000 for test, in 10 classes. We represent each image by a H = 784 vector of 
raw pixels. Because of the computatio nal cost of affinity-based methods, previous work has used training 


sets limited to a few thousand points ( Kulis and Darrell . 20091 : Norouzi and Fle^ . 2011 : Liu et ah . 2012 : 


Lin et al.l . 120131) . We train the hash functions in a subset of 10 000 points of the training set, and report 


precision and recall by searching for a test query on the entire dataset (the base set). 

We report precision and precision/recall for the test set queries using as ground truth (set of true neighbors 
in original space) all the training points with the same label. In precision curves, the retrieved set contains 
the k nearest neighbors of the query point in the Hamming space. We report precision for different values 
of k to test the robustness of different algorithms. In precision/recall curves, the retrieved set contains the 
points inside Hamming distance r of the query point. These curves show the precision and recall at different 
Hamming distances r = 0 to r = L. We report zero precision when there is no neighbor inside Hamming 
distance r of a query. This happens most of the time when L is large and r is small. In most of our 
precision/recall curves, the precision drops significantly for very small and very large values of r. For small 
values of r, this happens because most of the query points do not retrieve any neighbor. For large values of 
r, this happens because the number of retrieved points becomes very large. 

The main comparison point are the quadratic surrogate and GraphCut methods of iLin et all (l2013l 


20141) . which we denote in this section as quad and cut, respectively, regardless of the hash function that 
fits the resulting codes. Correspondingly, we denote the MAC version of these as MACquad and MACcut, 
respectively. We use the following schedule for the penalty parameter fi in the MAC algorithm (regardless 
of the hash function type or dataset). We initialize Z with /r = 0, i.e., the result of quad or cut. Starting 
from = 0.3 (MACcut) or 0.01 (MACquad), we multiply fj. by 1.4 after each iteration (Z and h step). 

Our experiments show that the MAC algorithm indeed finds hash functions with a significantly and 
consistently lower objective function value than rounding or two-step approaches (in particular, cut and 
quad); and that it outperforms other state-of-the-art algorithms on different datasets, with MACcut beating 
MACquad most of the time. The improvement in precision makes using MAC well worth the relatively small 
extra runtime and minimal additional implementation effort it requires. In all our plots, the vertical arrows 
indicate the improvement of MACcut over cut and of MACquad over quad. 


4.1 The MAC algorithm finds better optima 

The goal of this paper is not to introduce a new affinity-based loss or hash function, but to describe a generic 
framework to construct algorithms that optimize a given combination thereof. We illustrate its effectiveness 
here with the CIFAR dataset, with different sizes of retrieved neighbor sets, and using 16 to 48 bits. We 
optimize two affinity-based loss functions (KSH from eq. ([3]) and eSPLH), and two hash functions (linear 
and kernel SVM). In all cases, the MAC algorithm achieves a better hash function bot h in terms of the loss 


and of the precis ion/recall. We compare 4 ways of optimizing the loss function: quad ( Lin et ah . 20131) . cut 


i precis ii 

(|Lin et al.l I 2 OI 4 I) . MACquad and MACcut. 
For each point x„ in the training set 


we use K+ = 100 positive (similar) and k_ = 500 negative 


^To train a kernel SVM, we use 500 radial basis functions with centers given by a random subset of the training points, and 
apply a linear SVM to their output. Computationally, this is fast because we can use a constant Gram matrix. Using as hash 
function a kernel SVM trained with LIBSVM gave similar results, but is much slower because the support vectors change when 
the labels change. We set the RBF bandwidth to the average Euclidean distance of the first 300 points. 
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Figure 2: KSH (top panel) and eSPLH (bottom panel) loss functions on CIFAR dataset, using 6 = 16 to 48 
bits. The rows in each panel show the value of the loss function C, the precision for k retrieved points and 
the precision/recall (at different Hamming distances). 
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(dissimilar) neighbors, chosen at random to have the same or a different label as x„, respectively. FigJ^Jtop 


pane l) shows the KSH loss function for all the methods (including the original KSH method in iLiu et al 


I 2 OI 2 II over iterations of the MAC algorithm (KSH, quad and cut do not iterate), as well as precision and 
recall. It is clear that MACcut (red lines) and MACquad (magenta lines) reduce the loss function more than 
cut (blue lines) and quad (black lines), respectively, as well as the original KSH algorithm (cyan), in all cases: 
type of hash function (linear: dashed lines, kernel: solid lines) and number of bits 6 = 16 to 48. Hence, 
applying MAC is always beneficial. Reducing the loss nearly always translates into better precision and 
recall (with a larger gain for linear than for kernel hash functions, usually). The gain of MACcut/MACquad 
over cut/quad is significant, often comparable to the gain obtained by changing from the linear to the kernel 
hash function within the same algorithm. 

We usually find cut outperforms quad (in agreement with Lin et al] . l2014ll . and correspondingly MACcut 
outperforms MACquad. Interestingly, MACquad and MACcut end up being very similar even though they 
started very differently. This suggests it is not crucial which of the two methods to use in the MAC Z step, 
although we still prefer cut, because it usually produces somewhat better optima. Finally, fig. [2](bottom 
panel) shows the MACcut results using the eSPLH loss. All settings are as in the first KSH experiment. As 
before, MACcut outperforms cut in both loss function and precision/recall using either a linear or a kernel 
SVM. 


4.2 Why does MAC learn better hash functions? 

In both the two-step and MAC approaches, the starting point are the “free” binary codes obtained by 
minimizing the loss over the codes without them being the output of a particular hash function. That is, 
minimizing (U) without the “z„ = h(x„)” constraints: 

N 

minA(Z) = ^ L(z„,Zm; ynm), Zi,..., za? € {-I,-hI}^ (10) 

n—1 

The resulting free codes try to achieve good precision/recall independently of whether a hash function can 
actually produce such codes. Constraining the codes to be realizable by a specific family of hash functions 
(say, linear), means the loss A(Z) will be larger than for free codes. How difficult is it for a hash function 
to produce the free codes? Fig. |3] plots the loss function for the free codes, the two-step codes from cut, 
and the codes from MACcut, for both linear and kernel hash functions in the same experiment as in fig. [H 
It is clear that the free codes have a very low loss E{Z), which is far from what a kernel function can 
produce, and even farther from what a linear function can produce. Both of these are relatively smooth 
functions that cannot represent the presumably complex structure of the free codes. This could be improved 
by using a very flexible hash function (e.g. using a kernel function with many centers), which could better 
approximate the free codes, but 1) a very flexible function would likely not generalize well, and 2) we require 
fast hash functions for fast retrieval anyway. Given our linear or kernel hash functions, what the two-step 
cut optimization does is fit the hash function directly to the free codes. This is not guaranteed to find the 
best hash function in terms of the original problem o, and indeed it produces a pretty suboptimal function. 
In contrast, MAC gradually optimizes both the codes and the hash function so they eventually match, and 
finds a better hash function for the original problem (although it is still not guaranteed to find the globally 
optimal function of problem o, which is NP-complete). 

Fig. |4] illustrates this conceptually. It shows the space of all possible binary codes, the contours of E{Z) 
(green) and the set of codes that can be produced by (say) linear hash functions h (gray), which is the 
feasible set {Z G { —I,-|-I}^^^: Z = h(X) for linear h}. The two-step codes “project” the free codes onto 
the feasible set, but these are not the codes for the optimal hash function h. 

4.3 Runtime 

The runtime per iteration for our 10 000-point training sets with 6 = 48 bits and k+ = 100 and k_ = 500 
neighbors in a laptop is 2’ for both MACcut and MACquad. They stop after 10-20 iterations. Each iteration 
is comparable to a single cut or quad run, since the Z step dominates the computation. The iterations after 
the first one are faster because they are warm-started. 
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Figure 3: Like fig.[2]but showing the value of the error function E{Z) of eq. (1101) for the “free” binary codes, 
and for the codes produced by the hash functions learned by cut (the two-step method) and MACcut, with 
linear and kernel hash functions. 


4.4 Comparison with binary hashing methods 

Fig. [S] shows results on CIFAR and Infinite MNIST. We create affinities ynm for all methods using the 
dataset labels as before, with k+ = 100 similar neighbo rs and k- = 500 dissimilar neighb ors. We compar e 
MACquad and MACcut with Two-Step Hashing (quad) I Lin et al. . 2013i) . FastH ash f cut) (iLin et al. . 2014 ). 


Hashing with Kernels (KSH) ( Liu et ah . 20121). Itera t ive Q uantization (ITQ) ( Gong et al.L I20I3I) . Binary 


Reco nstructive Embeddings (BRE) ( Kulis and ParrellL 120091) and Self-Taught Hashing (STH) ( Zhang et al 


l2010l) . MACquad, MACcut, quad and cut all use the KSH loss function ([3]). The results show that MACcut 
(and MACquad) generally outperform all other methods, often by a large margin, in nearly all situations 
(dataset, number of bits, size of retrieved set). In particular, MACcut and MACquad are the only ones to 
beat ITQ, as long as one uses sufficiently many bits. 



Figure 4: Illustration of free codes, two-step codes and optimal codes realizable by a hash function, in the 
space { —I,-|-I}^^^. 
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Figure 5: Comparison with binary hashing methods on CIFAR (top panel) and Infinite MNIST (bottom 
panel), using a linear hash function, using 6 = 16 to 64 bits. The rows in each panel show the precision for 
k retrieved points, for a range of k, and the precision/recall at different Hamming distances. 


5 Discussion 


Two-step approaches vs th e MAC algor ithm for affin ity-based loss functions The two-step 


approach of Two-Step Hashing (|Lin et al.l . l2013f) and FastHash (ILin et al.l . 1201411 is a significant advance in 


finding good codes for binary hashing, but it also causes a maladjustment between the codes and the hash 
function, since the codes were learned without knowledge of what hash function would use them. Ignoring 
the interaction between the loss and the hash function limits the quality of the results. For example, a linear 
hash function will have a harder time than a nonlinear one at learning such codes. In our algorithm, this 
tradeoff is enforced gradually (as fi increases) in the Z step as a regularization term (eq. (I5|)): it finds the 
best codes according to the loss function, but makes sure they are close to being realizable by the current 
hash function. Our experiments demonstrate that significant, consistent gains are achieved in both the loss 
function value and the precision/recall in image retrieval over the two-step a pproach. _ 


A similar, well-known situation arises in feature selection for classification (IKohavi and JohnI . Il998f) . The 
best combination of classifier and features will result from jointly minimizing the classification error with 
respect to both classifier and features (the “wrapper” approach), rather than first selecting features according 
to some criterion and then using them to learn a particular classifier (the “filter” approach). From this point 
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Figure 6: As in fig.[5]but using the cosine similarity instead of the Euclidean distance to find neighbors (i.e., 
all the points are centered and normalized before training and testing), on CIFAR. 


of view, the two-step approaches of (|Lin et al.l . 120131120141) are filter approaches that first optimize the loss 


function over the codes Z (equivalently, optimize Lp with /r = 0) and then fit the hash function h to those 
codes. Any such hlter approach is then equivalent to optimizing Zp over (Z, h) for —> O’*'. 

The method of auxiliary coordinates algorithmically decouples (within each iteration) the two elements 
that make up a binary hashing model: the hash function and the loss function. Both elements act in combi¬ 
nation to produce a function that maps input patterns to binary codes so that they represent neighborhood 
in input space, but they play distinct roles. The hash function role is to map input patterns to binary codes. 
The loss function role is to assign binary codes to input patterns in order to preserve neighborhood relations, 
regardless of how easy it is for a mapping to produce such binary codes. By itself, the loss function would 
produce a nonparametric hash function for the training set with the form of a table of (image,code) pairs. 
However, the hash function and the loss function cannot act independently, because the objective function 
depends on both. The optimal combination of hash and loss is difficult to obtain, because of the nonlinear 
and discrete nature of the objective. Several previous optimization attempts for binary hashing first find 
codes that optimize the loss, and then fit a hash function to them, thus imposing a strict, suboptimal sepa¬ 
ration between loss function and hash function. In MAC, both elements are decoupled within each iteration, 
while still optimizing the correct objective: the step over the hash function does not involve the loss, and 
the step over the codes does not involve the hash function, but both are iterated. The connection between 
both steps occurs through the auxiliary coordinates, which are the binary codes themselves. The penalty 
regularizes the loss so that its optimal codes are progressively closer to what a hash function from the given 
class (e.g. linear) can achieve. 

What is the best type of hash function to use? The answer to this is not unique, as it depends on 
application-specific factors: quality of the codes produced (to retrieve the correct images), time to compute 
the codes on high-dimensional data (since, after all, the reason to use binary hashing is to speed up retrieval), 
ease of implementation within a given hardware architecture and software libraries, etc. Our MAC framework 
facilitates considerably this choice, because training different types of hash functions simply involves reusing 
an existing classification algorithm within the h step, with no changes to the Z step. 

In terms of runtime, the resulting MAC algorithm is not much slower than the two-step approach; it is 
comparable to iterating the latter a few times. Besides, since all iterations except the first are warm-started, 
the average cost of one iteration is lower than for the two-step approach. 

Finally, note that the method of auxiliary coordinates can be used also to learn an out-of-sample map- 
ping for a continuous emb edding (ICarreir a-PerDihan and Vladymyrov. 201511 . such as the elastic embedding 
( Carreira-Perpindil l20inl) or t-SNE ( van der Maaten and Hintonl . 2n08h —rather than to learn hash func¬ 
tions for a discrete embedding, as is our case in binary hashing. The resulting MAC algorithm optimizes 
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over the out-of-sample mapping and the auxiliary coordinates (which are the data points’ low-dimensional 
projections), by alternating two steps. One step optimizes the out-of-sample mapping that projects high¬ 
dimensional points to the continuous, latent space, given the auxiliary coordinates Z. This is a regression 
problem, while in binary hashing this is a classification problem (per hash function). The other step opti¬ 
mizes the auxiliary coordinates Z given the mapping and is a regularized continuous embedding problem. 
Both steps can be solved using existing algorithms. In particular, solving the Z st ep can be done efficiently 
with large datasets by using A^-body methods and efficient optimization techniques (jCarreirarTer^md^ 201^ 


Vladvmvrov and Carreira-Perpihan. 2012l : [TOn der Maatenl 20131 : Yang et 2013 : l^advmvrov and Carreira-Perninan , 


20141) . In binary hashing the Z step is a combinatorial optimization and, at present, far more challenging 
to solve. However, with continuous embeddings one must drive the penalty parameter /r to infinity for the 
constraints to be satisfied and so the solution follows a continuous path over /r S K, while with binary 
hashing the solution follows a discretized, piecewise path which terminates at a finite value of 


Binary autoencoder vs affinity-based loss, trained with MAC The method of auxiliary coordinates 
has als o been applied in the context of binary hashing to a different objective function, the binary autoencoder 
(BA) ( Carreira-Perpihan and Razioerchikolaei , 2015ll : 


N 

E 

n—l 


i;BA(h, f) = V] I|x„ - f(h(x„ 


( 11 ) 


wher e h is the hash function, or encoder (which outputs binary values), and f is a decoder. flTQ. I^ng et al 


I 2 OI 3 I can be seen as a suboptimal way to optimize this.) As with the affinity-based loss function, the MAC 
algorithm alternates between fitting the hash function (and the decoder) given the codes, and optimizing 
over the codes. However, in the binary autoencoder the optimization over the codes decouples over every data 
point (since the objective function involves one term per data point). This has an important computational 
advantage in the Z step: rather than having to solve one large optimization problem {zi,..., z^v} over Nb 
binary variables, it has to solve N small optimization problems {zi},...,{z 7 v} each over b variables, which 
is much faster and easier to solve (since b is relatively small in practice), and to parallelize. Also, the BA 
objective does not require any neighborhood information (e.g. the affinity between pairs of neighbors) and 
scales linearly with the dataset. Computing these affinity values, or even finding pairs of neighbors in the 
first place, is computationally costly. For these reasons, the BA can scale to training on larger datasets than 
affinity-based loss functions. 

The BA objective function does have the disadvantage of being less directly related to the goals that are 
desirable from an information retrieval poi nt of view, such as precision and recall. Neighbo rhood relations are 
only indirectly preserved by autoencoders ( Carreira-Perninan and Raziperchikolael 2015 ). whose direct aim 
is to reconstruct its inputs and thus to learn the data manifold (imperfectly, because of the binary projection 
layer). Affinity-based loss functions of the form ([T} allow the user to specify more complex neighborhood 
relations, for example based on class labels, which may significantly differ from the actual distances in image 
feature space. Still, finding more efficient and scalable optimization methods for binary embeddings (in the 
Z step of the MAC algorithm), that are able to handle larger numbers of training and neighbor points, would 
improve the quality of the loss function. This is an important topic of future research. 


6 Conclusion 

We have proposed a general framework for optimizing binary hashing using affinity-based loss functions. It 
improves over previous, two-step approaches based on learning binary codes first and then learning the hash 
function. Instead, it optimizes jointly over the binary codes and the hash function in alternation, so that 
the binary codes eventually match the hash function, resulting in a better local optimum of the affinity- 
based loss. This was possible by introducing auxiliary variables that conditionally decouple the codes from 
the hash function, and gradually enforcing the corresponding constraints. Our framework makes it easy to 
design an optimization algorithm for a new choice of loss function or hash function: one simply reuses existing 
software that optimizes each in isolation. The resulting algorithm is not much slower than the suboptimal 
two-step approach—it is comparable to iterating the latter a few times—and well worth the improvement in 
precision/recall. 
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The step over the hash function is essentially a solved problem if using a classifier, since this can be 
learned in an accurate and scalable way using machine learning techniques. The most difficult and time- 
consuming part in our approach is the optimization over the binary codes, which is N P-complete and involves 


many binary variables and terms in the objective. Although some techniques exist ( Lin et al. . 20131 l2014l) 


that produce practical results, designing algorithms that reliably find good local optima and scale to large 
training sets is an important topic of future research. 

Another direction for future work involves learning more sophisticated hash functions that go beyond 
mapping image features onto output binary codes using simple classifiers such as SVMs. This is possible 
because the optimization over the hash function parameters is confined to the h step and takes the form 
of a supervised classification problem, so we can apply an array of techniques from machine learning and 
computer vision. For example, it may be possible to learn image features that work better with hashing than 
standard features such as SIFT, or to learn transformations of the input to which the binary codes should 
be invariant, such as translation, rotation or alignment. 


A Additional experiments 


A.l Unsupervised dataset 


Although affinity-based hashing is intended to work with supervised datasets, it can also be used with 
unsup ervised ones, and our MAC approach applies just as well. We use the SIFTIM dataset (jjegou et ah . 
l2nil[ l. which contains N = 1000 000 training high-resolution color images and 10 000 test images, each 
represented hy D = 128 SIFT features. The experiments and conclusions are generally the same as with 
supervised datasets, with small differences in the settings of the experiments. In order to construct an 


affinity-based objective function, we define neighbors as follows. For each point in the training set we use the 
K+ = 100 nearest neighbors as positive (similar) neighbors, and k_ = 500 points chosen randomly among 
the remaining points as negative (dissimilar) neighbors. We report precision and precision/recall for the 
test set queries using as ground truth (set of true neighbors in original space) the K nearest neighbors in 
unsupervised datasets, and all the training points with the same label in supervised datasets. 

Fig. [7] shows results using KSH and eSLPH loss functions, respectively, with different sizes of retrieved 
neighbor sets and using 8 to 32 bits. As with the supervised datasets, it is clear that the MAC algorithm 
finds better optima and that MACcut is generally better than MACquad. Fig. [5] shows one case, using 
K+ = 50, K- = 1 000 and K = 10 000 (1% of the base set), where quad outperforms cut and correspondingly 
MACquad outperforms MACcut, although both MAC results are very close, particularly in precision and 
recall. 

Fig. [TOl shows results comparing with binary hashing methods. All methods are trained on a subset of 
5 000 points. We consider two types of methods. In the first type, we create pseudolabels for each point and 
then apply supervised methods as in CIFAR (in particular, cut/quad and MACcut/MACquad, using the KSH 
loss function). The pseudolabels ynm for each training point x„ are obtained by declaring as similar points 
its K+ = 100 true nearest neighbors and as dissimilar points a random subset of k_ = 500 points among the 
remaining points. In the second type, we use purely unsupervised methods (not based on similar/dissimilar 


affiniti es): thresholded PCA (tPCA), Iterative Quant ization (ITQ) (|Gong et al 


2013f). Binary Au toencoder 


(BA) ( Carreira-Peroina n and R aziperchikolaeil 12015 1. Spectral Hashing (SH) ( W eiss et al.l . l2009ll . Anchor- 
Graph Hashing (AGH) ( Liu et akT 2011 1. and Spherical Hashing (SPH) ( Heo et aTT 2012 ). The results are 
again in general agreement with the conclusions in the main paper. 


Comparison using code utilization Fig. [T2l shows the results (for all methods on SIFTIM) in effective _ 

numb er of bits 6efi. This is a measure of code utilization of a hash function introduced bv lGarreira-Perpifian and Raziperchikoh 

( 2015ll . defined as the entropy of the code distribution. That is, given the N codes Zi,... ,zn G {0,1}^ for 

the training set, we consider them as samples of a distribution over the 2^ possible codes. The entropy of 

this distribution, measured in bits, is between 0 (when all N codes are equal) and min(6, log 2 N) (when all N 

codes are distributed as uniformly as possible). We do the same for the test set. Although code utilization 

correlates to some extent with precision/recall when ranking different methods, a large 6eff does not guarantee 

a good hash function, and indeed, tPGA (which usually achi eves a low precision compared to the state - of-the - 

art) typically achieves the largest bes; see the discussion in ICarreira-Perpihan and Raziperchikolaei ( 2015 1. 
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Figure 7: Like fig. [2] but on SIFTIM dataset, for the KSH (top panel) and eSPLH (bottom panel) loss 
functions. The rows show the value of the loss function the precision (for a number of retrieved points k) 
and the precision/recall (at different Hamming distances), using 6 = 8 to 32 bits. 
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Figure 8: Like the top panel of fig. [7](KSH loss) but with k+ = 50, k_ = 1000, K = 10 000. 


However, a large beS does indicate a better use of the available codes (and fewer collisions if < 2^), and 
beff has the advantage over precision/recall that it does not depend on any user parameters (such as ground 
truth size or retrieved set size), so we can compare all binary hashing methods with a single number 6eff 
(for a given number of bits b). It is particularly useful to compare methods that are optimizing the same 
objective function. With this in mind, we can compare MACcut with cut and MACquad with quad because 
these pairs of methods optimize the same objective function. 


KSH eSPLH 



Figure 9: Like fig. [3] but for the SIFTIM dataset. 
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Figure 10: Comparison with binary hashing methods on SIFTIM using pseudolabels (top panel) and without 
labels (bottom panel). The rows in each panel show the precision (for a range of retrieved points k) and the 
precision/recall (at different Hamming distances), using 6 = 8 to 32 bits. 
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Figure 11: As in fig. [TUlbut using the cosine similarity instead of the Euclidean distance to find neighbors 
(i.e., all the points are centered and normalized before training and testing), on SIFTIM. 



Figure 12: Code utilization in effective number of bits bes (entropy of code distribution) of different hashing 
algorithms, using 6 = 8 to 32 bits, for the SIFTIM dataset. The plots correspond to the codes obtained by 
the algorithms in figure HOl with solid lines for the training set and dashed lines for the test set. The two 
diagonal-horizontal black dotted lines give the upper bound (maximal code utilization) min(6, log 2 N) on 6eff 
of any algorithm for the training and test sets (where N is the size of the training or test set). 
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